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'}  In  this  paper  we  consider  use  of  some  special  log- linear  models  and 
minimun^) estimation  in  the  multivariate  classification  problem,posed  by 
Martin  and  Bradley  (1972).  We  first  define  these  models,  called  log- 
difference  models,  and  show  that  the  minimum  risk  classification  rule 
depends  only  on  a  certain  subset  of  the  new  parameters.  We  then  review 
minimum  (?  estimation,  in  particular  the  minimum  £  estimator,  the  approxi¬ 
mate  minimum  <5  estimator,  and  their  existence  properties.  Ttoo  examples 
are  worked.  The  first  involves  detergent  preference  and  illustrates  how 
extensions  to  the  case  in  which  not  all  variables  are  dichotomous 
may  be  obtained  through  the  use  of  orthogonal  polynomials. 

The  second  example  involves  infant  hypoxic  trauma,  and  many  cells  are 
empty.  The  existence  conditions  are  used  to  find  a  model  for  which  esti¬ 
mates  of  cell  frequencies  exist  and  are  in  good  agreement  with  the  ob¬ 
served  data. 
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1.  Introduction 


This  paper  serves  the  dual  purposes  of  extending  the  classification 
problem  considered  by  Martin  and  Bradley  (1972),  and  of  illustrating  the 
uses  of  minimum  6  estimation  (Redman,  1981). 

Martin  and  Bradley  (hereafter  denoted  MB)  consider  the  problem  of 
classifying  individuals  from  L  populations,  n^,  ...,  n^,  into  J  cate¬ 
gories,  Cj,  ...,  Cj,  on  the  basis  of  a  random  vector  Z  which  consists  of 
I  dichotomous  variates.  They  reparameterize  the  21  possible  state  proba¬ 
bilities  as 

*(i)(£)  -  *(*)[!  ♦  h(a(l).  z)],  (1) 

where  ir^(z)  denotes  the  probability  of  state  z  for  the  £th  population, 
t(£)  denotes  the  probability  of  state  £  for  a  well-defined  composite  popu¬ 
lation,  and  h(&^,  z)  is  expressed  in  terms  of  21  orthogonal  polynomials, 
the  coefficients  in  being  specific  to  the  ith  population.  Models 

arise  through  the  approximation  of  h(j^,  £,)  in  (1)  by  a  set  of  low-order 

(l) 

polynomial  terms,  h  (ax  ,  z).  Thus,  models  for  irv  '(z)  are  of  the  form 
s  ^ 

*(A)(*)  -  »(*)[1  ♦  hs(aU),  jOb  (2) 

In  this  paper  we  generalize  the  problem.  We  assume  that  the  various 
levels  or  categories  for  the  I  variates  define  k  states,  which  are  labeled 
consecutively.  Thus,  while  MB  define  cells  in  their  tables  by  an  I-vector 
Z,  we  simply  take  Z  to  be  a  variable  which  may  take  on  values  1,  ...,  k. 

In  Section  2  we  propose  use  of  a  model,  called  the  log-difference 
s»del,  in  the  classification  problem.  As  with  the  difference  model  of 
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MB,  the  log-difference  model  features  some  parameters  that  are  general  to 
all  L  populations  and  some  that  are  specific  to  individual  populations. 

In  Section  3  we  review  the  classification  problem  in  detail.  In 
particular  the  minimum  risk  classification  rule  is  shown  to  depend  on 
those  parameters  specific  to  individual  populations  only. 

In  Section  4  the  minimum  S  and  approximate  minimum  S  estimation  proce¬ 
dures  are  introduced.  These  were  developed  due  to  the  lack  of  convenient 
conditions  for  the  existence  of  maximum  likelihood  estimates  in  sparse  data 
situations.  Convenient  conditions  for  the  minimum  S  estimator  have  been 
developed  and  are  stated  here.  The  new  estimators  have  been  shown  to  be 
asymptotically  equivalent  to  the  maximum  likelihood  estimator,  so  should 
yield  good  results  when  sample  sizes  are  large.  Full  details  may  be 
found  in  Redman  (1981). 

Examples  are  given  in  Sections  S  and  6.  The  first  is  designed  to 
illustrate  the  use  of  the  log-difference  model  when  one  of  the  variates 
has  ordered  categories.  The  sample  size  is  large,  and  the  maximum  likeli¬ 
hood,  minimum  6  and  approximate  minimum  6  estimates  are  nearly  equal.  In 
the  second  example  data  are  sparse,  and  the  conditions  stated  in  Section  4 
are  used  to  find  an  adequate  model  for  which  a  minimum  6  estimate  exists. 

2.  The  Log-Difference  Model 

Before  proceeding  with  the  development  of  the  log-difference  model, 
it  is  necessary  to  introduce  some  notation.  The  notation  is  similar  to 
that  used  in  Redman  (1981),  but  is  somewhat  simpler  due  to  the  structure 
necessary  in  the  classification  problem.  Throughout,  we  consider  two 
sampling  situations:  the  independent  samples  case,  in  which  independent 
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samples  of  size  are  available  from  each  population,  and  the  single 
sample  case,  in  which  the  data  arise  fron  a  single  sample  of  size  if  from 
all  L  populations.  !7e  shall  use  an  index  s  ■  L  for  the  independent  samples 
case  and  s  »  1  for  the  single  sample  case. 

Let  Q  be  the  random  (kL) -vector  whose  elements  denote  the  number  of 
observations  which  fall  in  each  of  the  k  cells  of  the  L  populations. 

Thus  g' »  ...,  where 

fn(A)]*  -  [n{£),  ...,  n££)],  t  •  1,  ....  L, 

and  n|£^  denotes  the  number  of  observations  for  state  i  of  the  £ttl  popu¬ 
lation.  In  the  independent  samples  case,  we  assume  the  multinomial 
distribution. 


n(£)  ^  Mult 


(D  .(*).  .00 


ir.w  >  0,  l  nf£)  =  1 
1  i»l  1 


i  *  1,  ....  L, 


and  define  p'  «  f  through  pf£^  *  nj^/N^.  In  the 

single  sample  case,  we  assume 


jj  'v  Mult 


N,  ir,  ir' 

9  iv'  <w 


<f2(1)l 


.00 


>  o. 


L 

l 

i«l 


k 

l 

i*l 


l 


■] 


with  J  -  n>  /N.  Note  that,  in  the  independent  samples  case, 

00 

s«L,  ir>  is  interpreted  as  the  probability  of  state  i  for  population  £, 
and,  in  the  single  sample  case,  s  *  1,  as  the  joint  probability  of  state  i 
and  population  Z. 

In  an  analogous  fashion,  we  define  x  and  x  »  *  ■  1»  •••*  L,  through 
*  log  ir|£^,  i  »  1,  ....  k.  A  log-linear  model  is  specified  by  m 
orthonormal  constraints  on  x.  gxX  ■  0jn.  where  g^Cs)  »  and  &(s)  -  lkL 


when  s  ■  1  and  A(s)  « 


'  k 


when  s  ■  L. 


kL*L 
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We  now  describe  a  two-stage  reparameterization  of  y  which  is  useful 
in  the  classification  problem.  In  the  first  stage,  (L  +  1)  sets  of  k 
parameters  are  defined;  one  set  is  general  to  all  populations,  and  each 
of  the  others  is  specific  to  a  given  population.  The  set  specific  to 
population  L  is  redundant  in  that  it  is  a  linear  function  of  the  other 
sets  and  is  not  considered  further.  Each  of  the  remaining  L  sets  is 
further  reparameterized  in  the  second  stage.  The  motivation  behind  this 
second  reparameterization  is  in  definition  of  the  log-difference  model 
in  which  certain  linear  functions  of  this  final  set  may  be  assumed  to 
be  zero.  This  permits  a  reduction  in  the  number  of  independent  parameters 
to  be  estimated  and  is  of  particular  importance  when  data  are  sparse. 

Define  k-vectors  y  ,  general  parameters,  and  parameters 

specific  to  population  £,  through 

y(£)  -  y  ♦  £  »  1 . L,  (3) 

~  ~K 

and 


h 


,(*) 


£*1 


/L. 


Since  (3)  and  (4)  imply  that 


i  -  1 1(,)  -  iV  v 

*■1  £-l~  £-l~g  K 


•  -IV0. 

£-1 


f£) 

and  we  need  only  consider  y  and  y  ,  £  »  1,  ....  L-l. 

~g  ~ 


(4) 


(5) 


s 


The  vectors 
by  neans  of 


2g 


and 


t  ■  1,  . ..,  L-l,  stay  be  further  decomposed 


Ig  "  M  <6> 

and 

£U)  -  MU).  *  *  1 . I*-!.  (7) 

where  Jq,  ....  are  k  *  k  orthonormal  matrices  and  the  k- vectors  v 

and  l  »  1,  ....  L-l,  are  new  general  and  specific  parameters  re¬ 

spectively.  In  a  log-difference  model,  m  independent  linear  functions 
of  these  parameters  may  be  specified  to  be  zero  so  long  as  they  are  con¬ 
structed  in  conformity  with  the  linear  constraints  of  a  log-linear  model 
as  previously  defined. 

For  a  log- linear  model  with  constraints  g^y  «  0^,  it  is  necessary 
that  jgjA  *  Qnxs.  Suppose  we  wish  to  specify 


( 8 ) 

(1/L  (1/L)^" 

(-l/L)Xk  (-1/L)^ 

•  •  • 

(L-1)/Lik  C-l/L)^ 


B* 

~1 


a 


(i) 


.(L-l) 


=  0  .  This  is  equivalent  to 


B?  Q  y  ■  0  , 

1  /<w  ^TTl 


where 


M 

axj 


a 

2 


2  2  •••  Xu 


(i/Di, 
(L-D/L  ^ 
•  •  • 

C-l/L)xk 


(9) 
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so  that  B*  g  is  equivalent  to  and  it  is  necessary  to  choose  gj  such  that 


-  w 


(10) 


In  practice  one  may  often  take 


*o  ■  . JSl-1  ■  *• 

Use  of  orthogonal  polynomials  in  the  specification  of  the  matrix  X  has 

proven  useful,  although  we  leave  the  way  open  for  other  choices  through 

the  use  of  arbitrary  orthonormal  matrices,  Xq,  . ..,  X^.  If  appropriate 

f  £1 

orthogonal  polynomials  are  used,  elements  of  v  and  jr  £  »  1,  . ...  L-l, 
may  be  given  interpretations,  for  example,  analogous  to  linear,  quadratic, 
and  higher-order  trend  terms  and  their  interactions  in  the  analysis  of 
variance  (Haberman,  1974),  and  selection  of  transformed  parameters  to 
be  taken  to  be  zero  for  model  simplification  may  proceed  as  in  that 
situation. 

3.  Classification  Procedures 

We  formalize  the  classification  problem  of  Section  1.  Following 
Martin  and  Bradley  (1972),  its  essential  features  are: 

(i)  There  are  a  finite  number  L  of  exclusive  and  exhaustive  popu¬ 
lations,  n^,  ...,  n^,  from  which  the  individual  to  be  classified 
may  arise. 

(ii)  There  are  a  finite  number  J  of  exclusive  and  exhaustive 
categories  Cj,  ...,  Cj  into  which  Individuals  are  to  be  classified. 

(iii)  Samples  from  each  population,  or  a  single  sample  from  all 


populations,  are  available. 
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(iv)  An  unknown  individual  U  e  with  probability  P(n^), 
l  •  1,  . ...  L,  is  to  be  classified. 

(v)  The  classification  of  U  is  to  be  made  on  the  basis  of  its 
belonging  to  cell  i*. 

(vi)  The  loss  entailed  by  classification  oft)  c  ,  when  U  e 
is  Lj  ,  j  ■  1,  . ...  J,  l  »  1,  . ...  L.  These  losses  are  taken  to  be 
finite,  and  may  be  taken  to  be  nonnegative  without  loss  of  generality. 
Conventionally,  correct  classification  is  indicated  by  zero  loss. 

The  minimum  risk  classification  rule  is:  Classify  U  e  C^A  if 
Rj*(i*)  is  a  minimum  of  (Rj(i*),  j  «  1,  ....  J>,  where 

R  (i)  «  |  L.(£)P(n{4))P(i|n(A))/P(i) 

3  t»l  3 

«=  l  L|A)P(i,  n(A))/P(i) 

£=1  3 

*  l  L<£)p(n(£)|i),  j  .  1,  ...,  j,  i  -  1,  ....  k,  (11) 

**i  J 

P(i)  denotes  the  probability  of  state  i,  P(i,  n  )  denotes  the  joint 
probability  of  state  i  and  population  and  P(i|n^)  and  P(n^|i) 

denote  conditional  probabilities.  If  the  minimum  is  not  unique,  one  may 
choose  any  C^,  such  that  j*  corresponds  with  any  one  of  the  minimizing 
values  of  R^(i*)  and  the  risk  is  not  affected.  The  minimum  risk  is 

k 

r»in  ■  l  pU)  «in{R,(i)>. 

“in  i»l  j  3 


(12) 
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The  classification  rule  above  may  be  simplified  somewhat  through  the 
use  of  (3).  In  the  single  sample  case,  R,.(i)  in  (11)  may  be  expressed  as 

Mi)  ■  l  l^exp  Y<*W) 

3  i»l  3  x 

-  I  if£)exp  Yffi  exp  A{£)/P(i) 
t«l  3  gl  1 

■  exp  y_4I  l  ii^exp  /P(i).  (13) 

gll*»l  3  1 

Clearly  the  minimum  risk  classification  rule  depends  solely  on  the  tt 
in  brackets  in  (13),  an  expression  involving  log-difference  paramete 
specific  to  each  population  only. 

Similarly,  when  independent  samples  from  each  population  are  available, 
Rj(i)  may  be  expressed  as 


This  time  the  minimum  risk  classification  rule  depends  on  the  term  in 

brackets  in  (14),  a  slightly  more  complicated  expression  than  (13),  in 

ffcl 

that  the  probabilities  P(nv  )  are  involved.  Again  the  log-difference 
parameters  general  to  all  populations  are  not  involved. 
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In  practice,  estimated  classification  rules  are  needed.  Log-difference 
modelling  will  be  used  and  estimators  of  A^  used  in  place  of  A^  in  (13) 
and  (14). 


4.  Minimum  6  and  Approximate  Minimum  6  Estimators 

In  this  section,  we  find  minimum  6  and  approximate  minimum  A  estimators 
following  the  general  procedures  of  Redman  (1981).  We  also  state  conditions 
which  are  of  use  for  determination  of  the  existence  of  a  minimum  A  estimator 
in  sparse  data  situations.  These  conditions  will  be  used  in  the  second 
example  of  Section  6. 

The  minimum  6  estimator  of  is  that  point  y  in 


V  hi =  2m’  ,^exP  *i  s  l>  1  * 


n&p  * 

1  l  exp  yJ^  =  1,  s  =  1 
£=1  i=l  1 


(*)  = 

i 

,  which  minimizes 


L,  s  =  L,  or 


<5(y;  fl)  *  I  l  rj^(log  pf*^  -  Y^^)2.  The  function  6  is  a  transformed 
£=1  i=l  1  1  1 

x2-like  function,  whose  use  is  motivated  by  the  linear  nature  of  some  of  the 
constraints  on  y  expressed  in  the  parameter  space  r(JBj).  The  minimum  6 
estimate  may  be  found  through  solution  of  the  following  system  of  equations: 


and 


or 


B2[N(1o8  E  "  Y)  -  /VI  =  2kL-m’ 


l  exp  Yl(t)  »1,  t®l,  ....  L,  .»L 
i=l  1 


L 

l 

£al 


k 

l 

i*l 


exp  y 


W 


1,  s  =  1. 


(15) 


Here,  Bj  is  an  orthocomplement  of  Bj,  that  is, 
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*1 

h 


h 

h 


4l* 


£  is  the  diagonal  matrix  with  entries  n 


(1) 


and  Ix(y CyJ 1} (y) .  ....  yJ1J(v).  ....  y}LJ(Y).  ....  y£L'(Y)l. 

k 


.0), 


•••I 

.a) 


„U) 

"k  * 


.(« 


.a), 


na) 

\  * 


/{‘’(D  ■  exp  >rJ  £  nj"(l08  p>*<  .  ,  .  L. 


.(*) 


.(*) 


(*)  „(«, 


and 


y{°(Y) 


exp  Y,'*5  I  l  n}*>(log  pj”  -  s  .  !, 

1  l-l  i-1  1  1  1 

i  -  1,  ....  k,  A  *  1,  ....  L. 

The  approximate  minimum  6  method  is  an  ad  hoo  variant  of  the  minimum 
6  method.  The  approximate  minimum  d  estimator  is  denoted  by  £a»  is  easy 
to  compute,  and  may  serve  as  an  initial  value  for  iterative  solution  of 
equations  (15). 

TWo  steps  are  required  for  the  calculation  of  y  .  The  first  involves 
minimization  of  6(x;n)  over  (y:  *  £„,*•  This  is  the  classic  minimi¬ 

zation  of  a  weighted  sum  of  squares  function  over  an  affine  space  and  the 
set  of  vectors  which  yield  the  desired  minimum  is  given  by 


r  *  {x:  Y  -  B’ [(g2NB*)’B2N  log  p 

*  a  -  i €  Ek"m}- 

Note  that  (jyNBJj)’  is  a  generalized  inverse  of  g2NB£.  When  &2NB2  is  non¬ 
singular, 

Y  «  fii(B2NBp_1B2N  log  p.  (16) 

The  second  step  in  calculation  of  y_  adjusts  y  so  the  resultant  esti- 
mated  probabilities  sum  to  one.  Thus 
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2ai 


(*)  .  c  (*) 


2  i 


-  C  (y), 

L  +» 


1. 


k,  l  •  1, 


L, 


(17) 


where 


cl(P 


k 

I 

i-1 


exp 


zW 

'i  1 


L,  s  -  L, 


or 

L  k  ,  . 

c,(x)  »  c(£)  -  l  I  exp  y{  »  s  ■  1. 
l-l  i-1 

Due  to  the  nature  of  the  set  f,  an  approximate  minimum  6  estimator  always 
exists,  and  the  estimator  is  unique  if  and  only  if  iUtyg;  is  nonsingular. 

The  set  T  is  also  intimately  related  to  the  existence  of  the  minimus 
6  estimator.  Let 


L  k 


r*  «  { 

where 

or 


y:  Zef’  l  exp  I»1,...,L,  s  «  L,  l  l  exp  y[£)  «  c,  s  -  1 

~  i-1  1  4  t-l  i-1 


c.  »  inf{c  (y)}  »  inf- 
1  yef  1  ~ 


l  expCyj*'5) 
i-1  1 


i  1  —  1,  •  • . ,  L,  s  —  L, 


c  *  inf{c(Y))  -  inf 
yef  ~ 


L  &  r„\ 

I  I  exptf^) 

fc»l  i-1 


,  s  -  1. 


The  set  T*  is  either  empty  or  a  singleton.  The  following  conditions  have 
been  obtained  by  Redman  (1981) : 


Condition  1:  A  minimum  6  estimator  exists  if  and  only  if  F*  is  a  singleton. 

Condition  2:  If  (g2NB£)  is  nonsingular,  then  a  minimum  6  estimator 
exists . 

Condition  3:  If  all  cells  contain  at  least  one  observation,  then  a 
minimum  5  estimator  exists. 


We  conclude  this  section  with  the  remark  that,  whenever  a  minimum  6  estimator 
exists,  it  is  unique. 
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S.  Detergent  Preference 

Our  first  example  involves  detergent  preference.  The  data  are  in 
Table  1,  were  collected  in  a  single  sanple(  s  ■  1,  by  Ries  and  Smith  (1963), 
and  have  been  previously  analyzed  by  Goodman  (1971),  Bishop,  Feinberg, 
and  Holland  (197S),  and  others. 

The  data  result  from  an  experiment  in  which  1008  people  expressed 
their  preferences  for  two  brands  of  detergent,  X  and  M.  They  responded 
also  to  three  questions  corresponding  to  three  variables: 

1.  Previous  experience  with  brand  M:  yes  or  no, 

2.  Water  hardness :  soft,  medium,  or  hard. 


3.  Water  temperature:  high  or  low. 

Respondents  were  taken  to  represent  two  populations,  n^:  consumers  who 
f21 

prefer  X,  and  H  ‘ :  consumers  who  prefer  M.  We  assume  that  the  levels 
of  variable  2  are  ordered  and  equally  spaced,  and  we  take  Xq  »  Xj  with 
orthonormal  columns  proportional  to  the  columns  of  the  following  matrix: 
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U 

This  matrix  has  been  constructed  with  the  aid  of  the  orthogonal  poly¬ 
nomials  of  Fisher  and  Yates  (1953) .  Interpretations  which  may  be  given 
to  the  elements  of  \>  and  are  given  in  Table  2. 
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Table  2.  --  Interpretations  Which  May  Be  Given  to 
Parameters  in  Detergent  Example 


General 

Specific 

Interpretation 

V1 

».0) 

overall  mean 

V2 

4° 

main  effect,  var.  1 

V3 

4" 

linear  tern,  main  effect,  var.  2 

v4 

4" 

quadratic  term,  main  effect,  var.  2 

v5 

4" 

main  effect,  var.  3 

V6 

4" 

var.  1  by  linear  term  var.  2  interaction 

V7 

4l) 

var.  1  by  quadratic  tern  var.  2  interaction 

V8 

"a" 

var.  1  by  var.  3  interaction 

V9 

41’ 

linear  term  var.  2  by  var.  3  interaction 

V10 

4? 

quadratic  term  var.  2  by  var.  3  interaction 

V11 

■8’ 

var.  1  by  linear  term  var.  2  by  var.  3  interaction 

V12 

«8’ 

var.  1  by  quadratic  term  var.  2  by  var.  3  interaction 

i. 
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The  authors  cited  have  fitted  a  variety  of  models  to  these  data.  We 
use  the  model  in  which 


v‘  -  (Vj,  ....  v5,  0.  0,  0,  V  v10,  o.  0)  -*• 


Iu(1)]’  -  [W!(1).  2J0], 


(18) 


because  the  previous  work  suggests  that  this  model  should  fit  the  data 
reasonably  well  and  because  it  involves  few  parameters,  particularly  those 
specific  to  population  . 

The  estimated  frequencies  derived  from  y,  y  ,  and  y  are  reported  in 
Table  1.  Computation  of  y  was  effected  through  use  of  CONTAB  (Zahn,  1974), 
and  confutation  of  y_  through  use  of  (16)  and  (17) .  Computation  of  y  was 
effected  through  solution  of  equations  (15)  by  means  of  Newton’s  Method, 
(Acton,  1970),  with  y  ,  the  initial  value.  The  left-hand  side  of  each  of 

m3 

-7 

equations  (IS),  evaluated  at  the  first  iterate,  was  less  than  10  ,  so  the 

first  iterate  was  taken  to  be  the  minimum  6  estimator. 

2  A  2 

The  goodness-of-fit  statistics  are  x  (£'.  S)  *  16.7265,  where  x  (x»  E) 

is  the  usual  Pearson  statistic,  and  6(^1  n)  ■  6 (j;  n)  •  16.5904.  Under 

2 

the  model,  all  three  statistics  are  approximately  distributed  as  Xj5 
(Redman,  1981).  The  values  of  these  statistics  are  only  slightly  above 
expectation,  indicating  good  fits  of  the  model  by  all  three  methods. 

For  classification  purposes  we  take  populations  and  to 
coincide  with  categories  Cj  and  Cj  and  »  L ^  ■  0,  l.j2^  ■  ■  1. 

For  these  losses,  the  classification  rule  is  simplified.  Now,  we  have, 
from  (13), 


2 

R,  Ci»)  a  l 

1  Jt-1 


(A) 


exp  A 


(A) 


exp  A 


(2) 


1/exp  A 


(1) 


and 


2 

R,(i*)a  l 

l*  1 


exp  aJJ5 


exp  A 


(2) 

i* 


I 


the  constants  of  proportionality  being  the  same.  The  classification  rule 
reduces  to:  Classify  U  <  Cj  if  A^  a  0,  and  U  €  C2  otherwise.  For  the 
model  used  here,  we  have  assumed 


and  the  minimum  S  estimates  of  and  are  “  0.0043,  p^  *  -0.7062. 

Thus,  the  estimated  classification  rule  is:  Classify  U  c  Cj  if 
Aj*1^  •  (0.0043  -  0.7062  Xli#2)  a  °»  where  X^  denotes  the  (i,  j)- 

element  of  Jj,  and  classify  U  «  C2  otherwise. 

Similarly,  the  minimum  risk  in  (12)  reduces  to  the  minimum  probability 
of  misclassification. 


rmin 


I  P(i)rain(P(n(1)|i),  P(n(2)|i)>. 
i*l 


To  estimate  this  probability,  we  simplify 
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rmin 


£  P(i)min{P(i,  n(1))/P(i),  P(i,  n(2))/P(i)> 
i-1 


-  £  ain{P(i,  P(i,  n(2))>,  (19) 

i-1 

and  sake  use  of  Table  1,  from  which  estimates  of  P(i,  IP  ')  may  be  obtained. 
For  instance,  for  state  1  the  minimum  6  estimators  are 

•  P(l,  n(15)  -  21.35/1008  •  0.021, 


and 

«  ir12  «  P(l,  n(2))  •  28.43/1008  -  0.028. 

The  estimated  contribution  to  (19)  for  i  *  1  is  0.021.  The  estimate  of 
rmin  *5  '-425  *r°T  X  Xa»  1*  is  0.429  for  y,  and,  for  all  three  estimators, 
the  proportion  of  individuals  in  the  study  that  would  be  misclassified  is 
0.429.  These  figures  compare  favorably  with  the  estimate  of  the  probability 
of  misdassification  0.421  based  on  the  full  model  but  are  by  no  means 
impressive. 

One  may  even  wonder  if  inclusion  of  and  in  the  aids  in 

the  fit  of  the  data.  To  examine  this,  we  assume  and  test 

Ho;  fe(1,J  ■  Si2> 


against 


H  :  [v(1)]' 
8  ** 


u‘‘>, 


-  2 
Under  HQ,  4(YjJ  n)  -  8(y;  n)  is  approximately  distributed  as  x2>  where  Yj 

is  the  minimum  6  estimate  of  £  under  HQ.  We  obtain  20.6404  for  this  statistic. 
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a  value  which  indicates  rejection  of  the  null  hypothesis  and  suggests  that 
inclusion  of  yj1^  and  y^  significantly  improves  the  fit  of  the  model  to 
the  data.  This  test  demonstrates  that  the  two  populations  are  indeed  dif- 
ferent  under  the  restricted  model  defined  in  (18).  Classification  should 
be  possible. 


6.  Hypoxic  Trauma 

The  second  example  involves  data  used  by  MB1  and  are  concerned 

2 

with  history  and  behavior  of  infants  following  hypoxic  trauma  .  For  these 
data  s*l,  1=4;  all  variables  are  dichotomous: 

1.  race,  white  or  nonwhite, 

2.  suggestive  or  nonsuggestive  medical  history  of  mother, 

3.  infant  first  breath  before  or  after  five  seconds, 

4.  infant  first  cry  before  or  after  30  seconds. 

The  populations  are  n ^ :  Infants  with  Apgar  scores3  of  seven  or  below  and 
nv  Infants  with  normal  Apgar  scores.  The  data  are  in  Table  3. 

We  take  JCq  ■  jCj  with  orthonormal  columns  proportional  to  the  columns 
of  the  following  matrix: 


1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 


111111111111111 
1  1  1-1  1  1-1  1-1-1  1  -1  -1  -1  -1 

1  1-1  1  1-1  1  -1  -1  -1  -1  1  -1  -1  1 

1  1-1-1  1  -1  -1  -1  1  1-1-1  1  1  1 

1-1  1  1-1  1  1-1  1  1-1-1  1-1-1 

1-1  1-1-1  1  -1  -1  -1  -1-1  1-1  1  1 
1-1-1  1-1-1  1  1  -1  -1  1-1-1  1  1 
1  -1  -1  -1  -1  -1  -1  1  1  1  1  1  1-1  -1 

-1  1  1  1  -1  -1  -1  1  1  1  -1  -1  -1  1  -1 

-1  1  1  -1  -1  -1  1  1  -1  -1-1  1  1-1  1 

-1  1-1  1-1  1  -1  -1  -1  -1  1-1  1-1  1 

-1  1  -1  -1  -1  1  1-1  1  1  1  1-1  1-1 
-1-1  1  1  1  -1  -1  -1  1  1  1  1-1-1  1 

-1  -1  1-1  1-1  1  -1  -1-1  1-1  1  1  -1 

-1-1-1  1  1  1-1  1  -1  -1  -1  1  1  1  -1 

-1  -1  -1  -1  1  1  1  1  1  1-1-1  -1  -1  1 


*MR  reference  unpublished 
Duke  University. 


data  of  Joan  C.  Martin  and 


Celia  Larnoer, 


2 

Hypoxic  trauma:  Damage  to  an  infant  during  or  shortly  after  birth 
caused  by  oxygen  deficiency. 

3 

Apgar  score:  An  index  of  the  level  of  physiological  functioning  based 
on  symptoms  of  the  infant  observed  shortly  after  birth.  See  Apgar  (19S3) . 
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This  natrix  has  been  derived  from  MB  (Equation  2.1).  The  parameter 
associated  with  the  first  column  of  this  matrix  may  be  interpreted  as  an 
overall  mean,  those  associated  with  columns  2-5  may  be  interpreted  as 
main  effects  of  variables  1-4  respectively,  and  parameters  associated 
with  succeeding  columns  as  interactions. 

Due  to  the  many  empty  cells  in  this  data  set,  the  possibility  that 
minimum  6  estimators  do  not  exist  for  many  models  is  of  concern.  Therefore, 
a  preliminary  study  to  identify  a  reasonable  model  for  which  a  minimum  6 
estimator  exists  had  to  be  undertaken.  The  results  of  this  preliminary 
examination  of  the  data  are  contained  in  Table  4.  In  step  1,  it  was 
determined  that  a  minimum  j  estimator  does  not  exist  for  the  model  in 
which  elements  of  v  and  y^  corresponding  to  main  effects  were  included 
(Mor'^l  1).  If  y^1)  is  deleted  from  Model  1,  a  minimum  6  estimator  does 
exist  (Model  2).  Elements  y^,  and  *4^  an<*  no  other  specific 

terms  are  included  in  succeeding  models.  The  existence  of  a  minimum  6 
estimator  for  Model  3,  which  involves  elements  of  corresponding  to  main 
effects  and  first-order  interactions,  was  checked  next.  No  minimum  6 
estimator  exists  for  this  model,  and  so,  in  the  final  step,  the  existences 
of  minimum  6  estimators  for  models  in  which  general  main  effects  and 
three  first-order  interaction  terms  involving  one  of  the  variables  were 
checked  (Models  4-7,  variables  1-4). 

Minimum  6  estimators  exist  for  Models  2,  4,  and  7.  Approximate  minimum 
6  estimators  were  calculated  for  these  models  and  the  values  of  the  cor¬ 
responding  6 n)  are  given  in  Table  4.  The  value  of  6(9^  jj)  for  Model  7 
appears  to  be  substantially  lower  than  for  the  other  models  and  so  Model  7 
was  selected  as  our  model. 


Computation  of  y  was  again  effected  iteratively  by  means  of  Newton's 

Method  with  use  of  y  ns  an  initial  estimate.  After  four  iterations, 

~a 

«(£;  q)  ■  3.2967  was  obtained.  Estimated  frequencies  based  on  £  and  y^ 
are  given  in  Table  3.  With  the  exception  of  a  few  zero  cells,  the  observed 
and  expected  frequencies  appear  to  agree  rather  well.  Observed  and  expected 
frequencies  agree  in  the  zero  cells  more  closely  for  the  MB  model,  but  it 
should  be  noted  that  the  MB  model  uses  nine  more  parameters  than  the  present 
model. 

Again  we  take  »  Z.^  *  0,  *  1.  As  in  the  previous 

example,  the  classification  rule  depends  only  on  A^,  and  the  probability 
of  misclassification  is  given  by  (19).  For  Model  7, 


and  -  1.0508,  iS^15  •  -0.7479,  J*15  «  0.8926,  and  «  2.3272. 

Thus,  the  estimated  classification  rule  is:  Classify  U  e  Cj  if 
A*!*  »  (1.0508  xliM  -  0.7479  xJi#2  ♦  0.8926  xli#J  ♦  2.3272  xu.s)  i  0, 
and  classify  U  t  C2  otherwise.  The  estimates  of  the  probability  of  mis¬ 
classification  are  0.375  for  v.  and  0.377  for  y,  and,  for  both  estimators, 

a# 

the  proportion  of  individuals  that  would  be  misclassified  is  0.379.  This 
proportion  matches  the  proportion  misclassified  for  the  MB  model. 
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niSTRISUTIOM  STATEMENT  (of  the  abstract  entered  in  Clock  ZJJ,  if  different  front  report, 

SUPPLEMENTARY  MOTE*) 

l 

KEY  WORDS 

Classification  problem,  log-difference  models,  minimum  6  estimation,  existence. 

ABSTRACT  (Continue  on  reverse  side  if  necessary  and  identify  by  block  number). 

In  this  paper  we  consider  use  of  some  special  log-linear  models  and  minimum  6 
estimation  in  the  multivariate  classification  problem  posed  by  Martin  and  Bradley 
(1972).  We  first  define  these  models,  called  log-difference  models,  and  show  that 
the  minimum  risk  classification  rule  depends  only  on  a  certain  subset  of  the  new 
parameters.  We  then  review  minimum  6  estimation,  in  particular  the  minimum  6  esti¬ 
mator,  the  approximate  minimum  6  estimator,  and  their  existence  properties.  Two 
examples  are  worked.  The  first  involves  detergent  preference  and  illustrates  how 
extensions  to  the  case  in  which  not  all  variables  are  dichotomous 
-lay  be  obtained  through  the  use  of  orthogonal  polynomials.  The  second 
example  involves  infant  hypoxic  trauma,  and  many  cells  are  empty.  The  existence 
conditions  are  used  to  find  a  model  for  which- estimates  of  cell  frequencies  exist 
and  are  in  good  agreement  with  the  observed  data. 


